Data provenance
===============
Data processes convert input data to output data. They can contain
simulations, but also data conversion, evaluation, aggregation, and
visualisation. They should be atomic, i.e. not consist of sub processes but
this is not a requirement.
In SciMesh, data processes are of class “Process” just like experimental
processes. You can chain them with “cause” relations, meaning that a data
process has as its potential input all the data produced by its predecessors.
Bulk data
---------
With “bulk data”, we mean an opaque octet stream of data at a specific URL. In
order to be referred to in a SciMesh graph, the response from the web server
should include a correct content type. Moreover, the URL must contain the
checksum of the data. If the protocol schema itself does not provide this
(e.g. IPFS URLs do), the URL fragment (the part behind the “``#``”) must
contain a hash using Multiformats_. In particular, the format is:
.. code-block:: text
base()
In other words, the binary is encoded by the function “base()”
(e.g. base32), and the character denoting that function (“b” in case of
base32) is prepended. is always the byte 0x01.
.. _Multiformats: https://multiformats.io/
Data input
----------
In order to see the exact data that is used, you have to have a deeper look
into the process (e.g. by inspecting the inputs manifest in the processing
program). In SciMesh, URLs to bulk input data are not explicit. (Of course,
you can make them explicit with your own vocabulary.) Analogously to physical
samples, the input is the whole graph of processes (and in particular, their
data outputs) that led to this process.
While technically, the program that does the data processing can download any
input data, a valid SciMesh graph makes sure that all of that is output of a
preceding process. Violating this is like not including all sample-influencing
parameters into a physical process.
In some cases that could mean that you have to create a preceding process just
to connect it with bulk output data URLs. Just do so, it is fine.
Data output
-----------
Any data output is represented by URIs that resolve to retrievable URLs with
that data, which are connected with the process using custom vocabulary (as it
is with measurement data for experimental processes). The process must be the
subject of such triples.
.. figure:: bulk_data.*
:width: 90%
:name: bulk_data
Representation of bulk output data in SciMesh. Here, “sm” is the namespace
“``http://schema.org/``”.
:numref:`bulk_data` shows
.. LocalWords: SciMesh Multiformats multihash LocalWords io